225 research outputs found
The Mechanism of Additive Composition
Additive composition (Foltz et al, 1998; Landauer and Dumais, 1997; Mitchell
and Lapata, 2010) is a widely used method for computing meanings of phrases,
which takes the average of vector representations of the constituent words. In
this article, we prove an upper bound for the bias of additive composition,
which is the first theoretical analysis on compositional frameworks from a
machine learning point of view. The bound is written in terms of collocation
strength; we prove that the more exclusively two successive words tend to occur
together, the more accurate one can guarantee their additive composition as an
approximation to the natural phrase vector. Our proof relies on properties of
natural language data that are empirically verified, and can be theoretically
derived from an assumption that the data is generated from a Hierarchical
Pitman-Yor Process. The theory endorses additive composition as a reasonable
operation for calculating meanings of phrases, and suggests ways to improve
additive compositionality, including: transforming entries of distributional
word vectors by a function that meets a specific condition, constructing a
novel type of vector representations to make additive composition sensitive to
word order, and utilizing singular value decomposition to train word vectors.Comment: More explanations on theory and additional experiments added.
Accepted by Machine Learning Journa
Mixture of Expert/Imitator Networks: Scalable Semi-supervised Learning Framework
The current success of deep neural networks (DNNs) in an increasingly broad
range of tasks involving artificial intelligence strongly depends on the
quality and quantity of labeled training data. In general, the scarcity of
labeled data, which is often observed in many natural language processing
tasks, is one of the most important issues to be addressed. Semi-supervised
learning (SSL) is a promising approach to overcoming this issue by
incorporating a large amount of unlabeled data. In this paper, we propose a
novel scalable method of SSL for text classification tasks. The unique property
of our method, Mixture of Expert/Imitator Networks, is that imitator networks
learn to "imitate" the estimated label distribution of the expert network over
the unlabeled data, which potentially contributes a set of features for the
classification. Our experiments demonstrate that the proposed method
consistently improves the performance of several types of baseline DNNs. We
also demonstrate that our method has the more data, better performance property
with promising scalability to the amount of unlabeled data.Comment: Accepted by AAAI 201
Selective Sampling for Example-based Word Sense Disambiguation
This paper proposes an efficient example sampling method for example-based
word sense disambiguation systems. To construct a database of practical size, a
considerable overhead for manual sense disambiguation (overhead for
supervision) is required. In addition, the time complexity of searching a
large-sized database poses a considerable problem (overhead for search). To
counter these problems, our method selectively samples a smaller-sized
effective subset from a given example set for use in word sense disambiguation.
Our method is characterized by the reliance on the notion of training utility:
the degree to which each example is informative for future example sampling
when used for the training of the system. The system progressively collects
examples by selecting those with greatest utility. The paper reports the
effectiveness of our method through experiments on about one thousand
sentences. Compared to experiments with other example sampling methods, our
method reduced both the overhead for supervision and the overhead for search,
without the degeneration of the performance of the system.Comment: 25 pages, 14 Postscript figure
- …